library(data.table) # Efficient Dataframe
library(lubridate) # For Dates
library(tidyverse) # Multiple Package for Useful Data wrangling
library(esquisse) # Intuitive plotting
library(plyr) # Data splitting
library(dplyr) # Data Wrangling
library(ggplot2) # Plot Graphs
library(naniar) # for NA exploration in Dataframe
library(plotly) # Make ggplot2 Dynamic
library(gridExtra) # Multiple Plot at once
library(RColorBrewer) # For Color Palette
library(rmdformats) # Theme of HTML
library(flextable) # Show Table
library(class) # K-NN
library(summarytools) # Beautiful and Efficient Summary for DatasetThose are required packages
If the company beings working with a new set of 1000 leads to sell the same services, similar to the 500 in the plot study, witout any use of predictive modeling to target sales efforts, what is the estimated profit?
Without any predictive modeling, we can roughly estimated the profit with the following formula:
\[ Sales_{Estimated} = 1000*\$2128 = \$2128000 \]
But the company will also have expenditures related to their sales, which would negatively impact the Total Profit. The sales effort would be:
\[ Costs_{Estimated} = 1000*\$2500 = \$2500000 \]
Leading to Total Profit of…
\[ TotalProfit_{Estimated} = Sales_{Estimated} - Costs_{Estimated} \\ = \$2128000 - \$2500000 = - \$372000 \]
If the firm wants the average profit on each sale to at least double the sales effort cost, and applies an appopriate cutoff with this predictive model to a new set of 1000 leads, how far down the new list of 1000 should it proceed (how many deciles)?
If we want to double the average profit on each sale, we should take the first decile (10%) on the Decile-wise lift chart which double the mean.
\[ Ratio_{Estimated} = \dfrac{2*\$2500}{\$2128} = 2.35 \]
Still considering the new list of 1000 leads, if the company applies this predictive model with a lower cutoff of $2500, how far should it proceed down the ranked leads, in terms of deciles?
We want the cutoff to be $2500:
\[ Ratio_{Estimated} = \dfrac{\$2500}{\$2128} = 1.17 \]
If we take a look at the Decile-wise lift chart, we see that until the 6th decile we would get a mean response of 1 for all those included decile until 6th one.
Why use this two-stage process for predicting sales–why not simply develop a model for predicting profit for the 1000 new leads?
This two-stage process for predicting sales already give possible answer to the managers for the 1000 leads predictions and which target customers they should choose for achieving such goals. Those 2 graphs help getting fast insights to the sales predictions and what to do for achieving the goals required by the board or company management.
## Data Frame Summary
## UniversalBank1
## Dimensions: 5000 x 14
## Duplicates: 0
##
## ---------------------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- -------------------- ------------------------------ ---------------------- ---------------------- ---------- ---------
## 1 ID Mean (sd) : 2500.5 (1443.5) 5000 distinct values : : : : : : : : : : 5000 0
## [integer] min < med < max: (Integer sequence) : : : : : : : : : : (100.0%) (0.0%)
## 1 < 2500.5 < 5000 : : : : : : : : : :
## IQR (CV) : 2499.5 (0.6) : : : : : : : : : :
## : : : : : : : : : :
##
## 2 Age Mean (sd) : 45.3 (11.5) 45 distinct values . : : 5000 0
## [integer] min < med < max: . : . : . : : . (100.0%) (0.0%)
## 23 < 45 < 67 : : : : : : : :
## IQR (CV) : 20 (0.3) : : : : : : : : : :
## : : : : : : : : : :
##
## 3 Experience Mean (sd) : 20.1 (11.5) 47 distinct values . . : : . 5000 0
## [integer] min < med < max: : : : : : : . : (100.0%) (0.0%)
## -3 < 20 < 43 : : : : : : : :
## IQR (CV) : 20 (0.6) . : : : : : : : : .
## : : : : : : : : : :
##
## 4 Income Mean (sd) : 73.8 (46) 162 distinct values : 5000 0
## [integer] min < med < max: . : : : (100.0%) (0.0%)
## 8 < 64 < 224 : : : :
## IQR (CV) : 59 (0.6) : : : : . .
## : : : : : : : : .
##
## 5 ZIP Code Mean (sd) : 93152.5 (2121.9) 467 distinct values : 5000 0
## [integer] min < med < max: : (100.0%) (0.0%)
## 9307 < 93437 < 96651 :
## IQR (CV) : 2697 (0) :
## :
##
## 6 Family Mean (sd) : 2.4 (1.1) 1 : 1472 (29.4%) IIIII 5000 0
## [integer] min < med < max: 2 : 1296 (25.9%) IIIII (100.0%) (0.0%)
## 1 < 2 < 4 3 : 1010 (20.2%) IIII
## IQR (CV) : 2 (0.5) 4 : 1222 (24.4%) IIII
##
## 7 CCAvg Mean (sd) : 1.9 (1.7) 108 distinct values : 5000 0
## [numeric] min < med < max: : . (100.0%) (0.0%)
## 0 < 1.5 < 10 : : .
## IQR (CV) : 1.8 (0.9) : : :
## : : : : . . .
##
## 8 Education Mean (sd) : 1.9 (0.8) 1 : 2096 (41.9%) IIIIIIII 5000 0
## [integer] min < med < max: 2 : 1403 (28.1%) IIIII (100.0%) (0.0%)
## 1 < 2 < 3 3 : 1501 (30.0%) IIIIII
## IQR (CV) : 2 (0.4)
##
## 9 Mortgage Mean (sd) : 56.5 (101.7) 347 distinct values : 5000 0
## [integer] min < med < max: : (100.0%) (0.0%)
## 0 < 0 < 635 :
## IQR (CV) : 101 (1.8) :
## : : . .
##
## 10 Personal Loan Min : 0 0 : 4520 (90.4%) IIIIIIIIIIIIIIIIII 5000 0
## [integer] Mean : 0.1 1 : 480 ( 9.6%) I (100.0%) (0.0%)
## Max : 1
##
## 11 Securities Account Min : 0 0 : 4478 (89.6%) IIIIIIIIIIIIIIIII 5000 0
## [integer] Mean : 0.1 1 : 522 (10.4%) II (100.0%) (0.0%)
## Max : 1
##
## 12 CD Account Min : 0 0 : 4698 (94.0%) IIIIIIIIIIIIIIIIII 5000 0
## [integer] Mean : 0.1 1 : 302 ( 6.0%) I (100.0%) (0.0%)
## Max : 1
##
## 13 Online Min : 0 0 : 2016 (40.3%) IIIIIIII 5000 0
## [integer] Mean : 0.6 1 : 2984 (59.7%) IIIIIIIIIII (100.0%) (0.0%)
## Max : 1
##
## 14 CreditCard Min : 0 0 : 3530 (70.6%) IIIIIIIIIIIIII 5000 0
## [integer] Mean : 0.3 1 : 1470 (29.4%) IIIII (100.0%) (0.0%)
## Max : 1
## ---------------------------------------------------------------------------------------------------------------------------
We can see that most of the variables are of type “integer” except CCAvg being of type “numeric”. We have no missing datas in all variables. We can see this better with a plot:
The Following Code Set Seed to 1 and partition the dataset in 2 sets, training and validation.
# Setting Seed
set.seed(1)
# Splitting Training and Validation
sample <- sample(c(TRUE, FALSE), nrow(UniversalBank1), replace=TRUE, prob=c(0.6,0.4))
training <- UniversalBank1[sample, ]
validation <- UniversalBank1[!sample, ]
# Checking if proportions are right
train_prop <- dim(training)
validation_prop <- dim(validation)
train_prop_100 <- (train_prop[1]/nrow(UniversalBank1))*100
validation_prop_100 <- (validation_prop[1]/nrow(UniversalBank1))*100
paste(train_prop_100,"% In Training",validation_prop_100,"% In Validation")[1] “61 % In Training 39 % In Validation”
Here is the confirmation of the effective pourcentages of each set category after the partition process.
Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1.
Perform a Κ-NN Classification with all predictors except ID and ZIP code using Κ = 1
# Setting Seed
set.seed(1)
# Removing Some Predictors
training <- training[,-c("ID","ZIP Code")]
validation <- validation[,-c("ID","ZIP Code")]
# Target Variable As Factor
training$`Personal Loan` <- factor(training$`Personal Loan`, levels = c(0,1),labels = c("No Loan","Loan"))
validation$`Personal Loan` <- factor(validation$`Personal Loan`, levels = c(0,1),labels = c("No Loan","Loan"))
# Education As Factor
training$Education <- as.factor(training$Education)
validation$Education <- as.factor(validation$Education)
# Education One-Hot Encoding
Education_As_Dummy_Training <- model.matrix(~0+training$Education)
Education_As_Dummy_Validation <- model.matrix(~0+validation$Education)
# Append to Training and Validation Sets
training <- cbind(training,Education_As_Dummy_Training)
training <- training[,-c("Education")]
validation <- cbind(validation,Education_As_Dummy_Validation)
validation <- validation[,-c("Education")]
# Renaming Education
training = training %>% rename( Education_1 = `training$Education1` , Education_2 = `training$Education2`, Education_3 = `training$Education3`)
validation = validation %>% rename( Education_1 = `validation$Education1` , Education_2 = `validation$Education2`, Education_3 = `validation$Education3`)
# Preprocess for Data Normalization
library(caret)
training_norm <- training
validation_norm <- validation
training_norm_s <- training[,-c("Personal Loan")]
norm_values <- preProcess(training_norm_s,method = c("center","scale"))
training_norm <- predict(norm_values,training)
validation_norm <- predict(norm_values,validation)
# KNN Model using class package
library(class)
# Data frame for a specific customer not in Data
Customer_Test <- data.frame("Age"=40,"Experience"=10,"Income"=84,"Family"=2,"CCAvg"=2,"Mortgage"=0,"Securities Account"=0,"CD Account"=0,"Online"=1,"CreditCard"=1,"Education_1"=0,"Education_2"=1,"Education_3"=0, check.names=FALSE)
# Preprocess the Customer New Data
Customer_Test_norm <- predict(norm_values, Customer_Test)
## KNN Training for Customer
predictions_customer <- knn(train=training_norm[,-c("Personal Loan")],test = Customer_Test_norm, cl = training_norm$`Personal Loan`, k=1)
# Append Predictions to Customer not in Data
Customer_Test$Predicted <- predictions_customer# Table Customer after Normalization
flextable(head(Customer_Test_norm)) %>%
fontsize(size = 7, part = "all")Age | Experience | Income | Family | CCAvg | Mortgage | Securities Account | CD Account | Online | CreditCard | Education_1 | Education_2 | Education_3 |
-0.4566169 | -0.8659041 | 0.199656 | -0.3477529 | 0.01439031 | -0.5640238 | -0.3523862 | -0.264863 | 0.8253143 | 1.51897 | -0.8658006 | 1.643341 | -0.6565908 |
# Table Customer after Predictions
flextable(head(Customer_Test)) %>%
fontsize(size = 7, part = "all")Age | Experience | Income | Family | CCAvg | Mortgage | Securities Account | CD Account | Online | CreditCard | Education_1 | Education_2 | Education_3 | Predicted |
40 | 10 | 84 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | No Loan |
This Customer would be classified as not getting a Personnal Loan (No Loan) by K-NN with K=1.
We would to test multiple K such that the best accuracy would be chosen between the training and validations set (cross validation procedure)
# Setting Seed
set.seed(1)
# Load Caret
library(caret)
# Number of iterations
max_iterations = 30
# Dataframe with 2 columns: k and accuracy
accuracy.df <- data.frame(k=seq(1,max_iterations,1),accuracy=rep(0,max_iterations))
# Compute K-NN for different k on validation
for(i in 1:max_iterations){
# Testing K-NN
knn.prediction <- knn(train = training_norm[,-c("Personal Loan")], test=validation_norm[,-c("Personal Loan")] , cl=training_norm$`Personal Loan`, k=i)
# Storing into the accuracy.df results
accuracy.df[i,2] <- confusionMatrix(knn.prediction, validation$`Personal Loan`)$overall[1]
}
# Table of Accuracy
flextable(accuracy.df) %>% fontsize(size = 12, part = "all")k | accuracy |
1 | 0.9676923 |
2 | 0.9574359 |
3 | 0.9641026 |
4 | 0.9574359 |
5 | 0.9564103 |
6 | 0.9533333 |
7 | 0.9538462 |
8 | 0.9507692 |
9 | 0.9507692 |
10 | 0.9476923 |
11 | 0.9487179 |
12 | 0.9471795 |
13 | 0.9466667 |
14 | 0.9451282 |
15 | 0.9456410 |
16 | 0.9446154 |
17 | 0.9425641 |
18 | 0.9415385 |
19 | 0.9415385 |
20 | 0.9425641 |
21 | 0.9420513 |
22 | 0.9415385 |
23 | 0.9400000 |
24 | 0.9394872 |
25 | 0.9405128 |
26 | 0.9389744 |
27 | 0.9389744 |
28 | 0.9358974 |
29 | 0.9369231 |
30 | 0.9374359 |
# Ploting the K and accuracy together
ggplotly(
ggplot(accuracy.df) +
aes(x = k, y = accuracy) +
geom_line(size = 0.5, colour = "#1c6155") +
labs(x = "Number of K",
y = "Accuracy (Between Training and Validation)", title = "K-NN Accuracy regarding parameter K") +
theme_minimal()
)# Choosing Efficient K
highest_K <- which.max(accuracy.df$accuracy)
print(paste("Best K for Highest Accuracy is",highest_K))[1] “Best K for Highest Accuracy is 1”
We can see that the highest the K, the less is the accuracy of the model is through cross validation.
# Setting Seed
set.seed(1)
# Computing Confusion Matrix with Best K
predictions_k <- knn(train=training_norm[,-c("Personal Loan")],test = validation_norm[,-c("Personal Loan")], cl = training_norm$`Personal Loan`, highest_K)
# Confusion Matrix
Confusion_Matrix_k <- confusionMatrix(data = predictions_k, reference = validation$`Personal Loan`)
# Plotting Matrix Function
draw_confusion_matrix <- function(cm) {
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
rect(150, 430, 240, 370, col='#1c6155')
text(195, 435, 'No Loan', cex=1.2)
rect(250, 430, 340, 370, col='#1c615570')
text(295, 435, 'Loan', cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col='#1c615570')
rect(250, 305, 340, 365, col='#1c6155')
text(140, 400, 'No Loan', cex=1.2, srt=90)
text(140, 335, 'Loan', cex=1.2, srt=90)
# add in the cm results
res <- as.numeric(cm$table)
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
#Plot the Confusion Matrix
draw_confusion_matrix(Confusion_Matrix_k)This Confusion Matrix has a Accuracy of 0.968 and Specifity of 0.738, lower than the Sensitivity. Kappa is equal to 0.793. Among all Confusion Matrix, this one has the highest Specifity and Kappa. This Confusion Matrix has the highest F1 but really close to the last Confusion Matrix.
Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1.
# Setting Seed
set.seed(1)
# KNN Model on a specific customer not in Data
Customer_Test_2 <- data.frame("Age"=40,"Experience"=10,"Income"=84,"Family"=2,"CCAvg"=2,"Mortgage"=0,"Securities Account"=0,"CD Account"=0,"Online"=1,"CreditCard"=1,"Education_1"=0,"Education_2"=1,"Education_3"=0, check.names=FALSE)
# Preprocess the Customer New Data
Customer_Test_2_norm <- predict(norm_values, Customer_Test_2)
## KNN Training for Customer
predictions_customer_2 <- knn(train=training_norm[,-c("Personal Loan")],test = Customer_Test_2_norm, cl = training_norm$`Personal Loan`, k=highest_K)
# Append Predictions to Customer not in Data
Customer_Test_2$Predicted <- predictions_customer_2# Table Customer after Predictions
flextable(head(Customer_Test_2)) %>%
fontsize(size = 7, part = "all")Age | Experience | Income | Family | CCAvg | Mortgage | Securities Account | CD Account | Online | CreditCard | Education_1 | Education_2 | Education_3 | Predicted |
40 | 10 | 84 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | No Loan |
# Setting Seed
set.seed(1)
# Splitting Training and Validation and Test
splitting <- sample(1:3,size=nrow(UniversalBank1),replace=TRUE,prob=c(0.5,0.3,0.2))
train <- UniversalBank1[splitting==1,]
valid <- UniversalBank1[splitting==2,]
test <- UniversalBank1[splitting==3,]
# Checking if proportions are right
Prop_train <- (nrow(train)/nrow(UniversalBank1))*100
Prop_valid <- (nrow(valid)/nrow(UniversalBank1))*100
Prop_test <- (nrow(test)/nrow(UniversalBank1))*100
# Print Proportion
paste(Prop_train,"% In Training",Prop_valid,"% In Validation",Prop_test,"% In Test")## [1] "51.28 % In Training 27.74 % In Validation 20.98 % In Test"
# Setting Seed
set.seed(1)
# Removing Some Predictors
train <- train[,-c("ID","ZIP Code")]
valid <- valid[,-c("ID","ZIP Code")]
test <- test[,-c("ID","ZIP Code")]
# Target Variable As Factor
train$`Personal Loan` <- factor(train$`Personal Loan`, levels = c(0,1), labels=c("No Loan","Loan"))
valid$`Personal Loan` <- factor(valid$`Personal Loan`,levels = c(0,1), labels=c("No Loan","Loan"))
test$`Personal Loan` <- factor(test$`Personal Loan`,levels = c(0,1), labels=c("No Loan","Loan"))
# Education As Factor
train$Education <- as.factor(train$Education)
valid$Education <- as.factor(valid$Education)
test$Education <- as.factor(test$Education)
# Education One-Hot Encoding
Education_As_Dummy_Train <- model.matrix(~0+train$Education)
Education_As_Dummy_Valid <- model.matrix(~0+valid$Education)
Education_As_Dummy_Test <- model.matrix(~0+test$Education)
# Append to Training and Validation Sets
train <- cbind(train,Education_As_Dummy_Train)
train <- train[,-c("Education")]
valid <- cbind(valid,Education_As_Dummy_Valid)
valid <- valid[,-c("Education")]
test <- cbind(test,Education_As_Dummy_Test)
test <- test[,-c("Education")]
# Renaming Education
train = train %>% rename( Education_1 = `train$Education1` , Education_2 = `train$Education2`, Education_3 = `train$Education3`)
valid = valid %>% rename( Education_1 = `valid$Education1` , Education_2 = `valid$Education2`, Education_3 = `valid$Education3`)
test = test %>% rename( Education_1 = `test$Education1` , Education_2 = `test$Education2`, Education_3 = `test$Education3`)
# Preprocess for Data Normalization
library(caret)
train_norm <- train
validn_norm <- valid
test_norm <- test
train_norm_s <- train[,-c("Personal Loan")]
norm_values_2 <- preProcess(train_norm_s,method = c("center","scale"))
train_norm <- predict(norm_values_2,train)
valid_norm <- predict(norm_values_2,valid)
test_norm <- predict(norm_values_2,test)Confusion Matrix for Train VS Valid
# Train VS Valid
# Setting Seed
set.seed(1)
# Computing Confusion Matrix with Best K
predictions_k_new <- knn(train=train_norm[,-c("Personal Loan")],test = valid_norm[,-c("Personal Loan")], cl = train_norm$`Personal Loan`, highest_K)
# As Factor Predictions
predictions_k_new <- as.factor(predictions_k_new)
# Confusion Matrix
Confusion_Matrix_k_New <- confusionMatrix(data = predictions_k_new, reference = valid$`Personal Loan`)
#Plot the Confusion Matrix
draw_confusion_matrix(Confusion_Matrix_k_New)Specificity is 0.669 which is lower than the Sensitivity (0.99), and kappa of 0.735 with accuracy of 0.96.
Confusion Matrix for Train VS Test
# Train VS Test
# Setting Seed
set.seed(1)
# Computing Confusion Matrix with Best K
predictions_k_new2 <- knn(train=train_norm[,-c("Personal Loan")],test = test_norm[,-c("Personal Loan")], cl = train_norm$`Personal Loan`, highest_K)
# As Factor Predictions
predictions_k_new2 <- as.factor(predictions_k_new2)
# Confusion Matrix
Confusion_Matrix_k_New2 <- confusionMatrix(data = predictions_k_new2, reference = test$`Personal Loan`)
#Plot the Confusion Matrix
draw_confusion_matrix(Confusion_Matrix_k_New2)The Specificity is higher (0.716) than the previous Confusion Matrix (0.669 - Train VS Valid), but still lower than the Sensitivity. The Accuracy is a bit higher 0.965 > 0.96, but Kappa is higher with this Confusion Matrix (0.767>0.735).
UniversalBank2 <- fread("DATA/UniversalBank.csv")
# Setting Seed
set.seed(1)
# Splitting Training and Validation
sample2 <- sample(c(TRUE, FALSE), nrow(UniversalBank2), replace=TRUE, prob=c(0.6,0.4))
training_8 <- UniversalBank2[sample2, ]
validation_8 <- UniversalBank2[!sample2, ]
# Checking if proportions are right
training_8_prop <- (nrow(training_8)/nrow(UniversalBank2))*100
validation_8_prop <- (nrow(validation_8)/nrow(UniversalBank2))*100
paste(training_8_prop,"% In Training",validation_8_prop,"% In Validation")## [1] "61 % In Training 39 % In Validation"
library(pivottabler)
# Duplicata of Training Data for Pivot Data
pivot_data <- training_8
# As Factor
pivot_data$Online <- factor(pivot_data$Online,levels = c(0,1),labels=c("Inactive Online","Active Online"))
pivot_data$CreditCard <- factor(pivot_data$CreditCard,levels = c(0,1),labels=c("No Credit Card","Credit Card"))
pivot_data$`Personal Loan` <- factor(pivot_data$`Personal Loan`,levels = c(0,1),labels=c("No Personal Loan","Personal Loan"))
# Pivot Table
pt <- PivotTable$new()
pt$addData(pivot_data)
pt$addColumnDataGroups("Online")
pt$addRowDataGroups("CreditCard")
pt$addRowDataGroups("Personal Loan")
pt$defineCalculation(calculationName="Total", summariseExpression="n()")
pt$renderPivot()Using Bayes Theorem
\[\small P(Loan=1 | CC=1 \cap Online=1) = \\ \small \dfrac{54}{506+54} =\dfrac{54}{560} = 0.09642857 = 9.64 \%\]
Thus, there is 9.64% probability that this kind of customer would accept the loan offer.
library(pivottabler)
# Pivot Table 1
pt1 <- PivotTable$new()
pt1$addData(pivot_data)
pt1$addColumnDataGroups("Online")
pt1$addRowDataGroups("Personal Loan")
pt1$defineCalculation(calculationName="Total", summariseExpression="n()")
pt1$renderPivot()# Pivot Table 1
pt2 <- PivotTable$new()
pt2$addData(pivot_data)
pt2$addColumnDataGroups("CreditCard")
pt2$addRowDataGroups("Personal Loan")
pt2$defineCalculation(calculationName="Total", summariseExpression="n()")
pt2$renderPivot()\[\small P (CC=1 | Loan=1) = \\ \small \dfrac{95}{297} = 0.3198653 = 31.99 \%\]
\[\small P (Online=1 | Loan=1) =\\ \small \dfrac{188}{297}= 0.6329966 = 63.30 \%\]
\[\small P (Loan=1) = \dfrac{297}{3050} = 0.09737705 = 9.74 \% \]
\[\small P (CC=1 | Loan=0) = \\ \small \dfrac{827}{2753} = 0.3003996 = 30.04 \%\]
\[\small P (Online=1 | Loan=0) = \\ \small \dfrac{1626}{2753} = 0.5906284 = 59.06 \%\]
\[\small P (Loan=0) = \dfrac{2753}{3050} = 0.902623 = 90.26\%\]
\[\small P(Loan=1|CC=1,Online=1)\] Using the naive Bayes Probability give us the following computation:
\[\small P(Loan=1|CC=1,Online=1) = \\ \small \dfrac{P(Loan=1)*P(CC=1|Loan=1)*P(Online=1|Loan=1)}{P(CC=1)*P(Online=1)}=\\ \small \dfrac{0.09737705*0.3198653*0.6329966}{0.3022951*0.5947541} = 0.1096621 = 10.97\% \]
The Naive approach give us 10.97%, while the Exact Bayes probability give us 9.64%
\[P(Loan=1|CC=1,Online=1)\]
library(naivebayes)
# As factor for Loan
training_8$Online <- factor(training_8$Online,levels = c(0,1),labels=c("Inactive Online","Active Online"))
training_8$CreditCard <- factor(training_8$CreditCard,levels = c(0,1),labels=c("No Credit Card","Credit Card"))
training_8$`Personal Loan` <- factor(training_8$`Personal Loan`,levels = c(0,1),labels=c("No Personal Loan","Personal Loan"))
Naivebayes <- naive_bayes(training_8$`Personal Loan` ~ training_8$CreditCard + training_8$Online, data=training_8)
summary(Naivebayes)##
## ================================== Naive Bayes ==================================
##
## - Call: naive_bayes.formula(formula = training_8$`Personal Loan` ~ training_8$CreditCard + training_8$Online, data = training_8)
## - Laplace: 0
## - Classes: 2
## - Samples: 3050
## - Features: 2
## - Conditional distributions:
## - Bernoulli: 2
## - Prior probabilities:
## - No Personal Loan: 0.9026
## - Personal Loan: 0.0974
##
## ---------------------------------------------------------------------------------
Naivebayes##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = training_8$`Personal Loan` ~ training_8$CreditCard +
## training_8$Online, data = training_8)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## No Personal Loan Personal Loan
## 0.90262295 0.09737705
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: training_8$CreditCard (Bernoulli)
## ---------------------------------------------------------------------------------
##
## training_8$CreditCard No Personal Loan Personal Loan
## No Credit Card 0.6996004 0.6801347
## Credit Card 0.3003996 0.3198653
##
## ---------------------------------------------------------------------------------
## ::: training_8$Online (Bernoulli)
## ---------------------------------------------------------------------------------
##
## training_8$Online No Personal Loan Personal Loan
## Inactive Online 0.4093716 0.3670034
## Active Online 0.5906284 0.6329966
##
## ---------------------------------------------------------------------------------
\[\small P(Loan=1|CC=1,Online=1) = \\ \small \dfrac{P(Loan=1)*P(CC=1|Loan=1)*P(Online=1|Loan=1)}{P(CC=1)*P(Online=1)}=\\ \small \dfrac{0.09737705*0.3198653*0.6329966}{0.3022951*0.5947541} = 0.1096621 = 10.97\% \]